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The  purpose  of  this  effort  was  to  study  visual,  focal  selective  attention  and 
its  implementation  in  the  primate  visual  system  from  a  computational  point  of 
view.  It  is  know  that  at  the  neuronal  level,  two  cortical  pathways  exists 
that  are  responsible  for  mediating  attention:  the  where  pathway  that  selects 
interesting  or  conspicuous  locations  and  the  what  pathway  that  identifies  and 
recognizes  objects.  In  the  effort,  it  was  shown  how  neuronal  networks  based 
on  those  found  in  the  cerebral  cortex  can  implemented  these  pathways  using 
real  images.  In  particular,  the  use  of  a  saliency  map,  that  encodes  how 
"interesting”  or  "salient\*  locations  are  in  the  visual  field  (rather  than  what 
features  are  present  at  these  locations)  represents  a  powerful  strategy  to  aid 
visual  search.  These  algorithms  are  being  ported  onto  Pentium-based  machines 
for  various  machine-vision  applications. 
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Technical  Progress 


The  purpose  of  this  effort  was  to  study  visual,  focal  selective  attention  and 
its  implementation  in  the  primate  visual  system  from  a  computational 
point  of  view.  It  is  know  that  at  the  neuronal  level,  two  cortical  pathways 
exists  that  are  responsible  for  mediating  attention:  the  where  pathway  that 
selects  interesting  or  conspicuous  locations  and  the  what  pathway  that 
identifies  and  recognizes  objects.  In  the  effort,  it  was  shown  how  neuronal 
networks  based  on  those  found  in  the  cerebral  cortex  can  implemented 
these  pathways  using  real  images.  In  particular,  the  use  of  a  saliency 
map,  that  encodes  how  "interesting"  or  "salient"  locations  are  in  the  visual 
field  (rather  than  what  features  are  present  at  these  locations)  represents  a 
powerful  strategy  to  aid  visual  search.  These  algorithms  are  being  ported 
onto  Pentixim-based  machines  for  various  machine-vision  applications. 


Introduction 


The  computations  of  early  vision  are  essentially  parallel  operations,  i.e., 
they  are  applied  in  parallel  to  all  parts  of  the  visual  field.  This  high  degree 
of  parallelism  cannot  be  sustained  in  intermediate  and  higher  vision 
because  of  the  astronomical  number  of  different  possible  combination  of 
features.  Therefore,  it  becomes  necessary  to  select  only  a  part  of  the 
instantaneous  sensory  input  for  more  detailed  processing  and  to  discard 
the  rest.  This  is  the  mechanism  of  visual  selective  attention  which  we  set 
out  to  study  by  computational  methods. 

Primate  vision  is  organized  along  two  major  anatomical  pathways.  One  of 
them  is  concerned  mainly  with  object  recognition.  For  this  reason,  it  has 
been  called  the  What  pathway;  for  anatomical  reasons,  it  is  also  known  as 
the  ventral,  or  occipito-temporal,  pathway.  The  principal  task  of  the  other 
major  pathway  is  the  determination  of  the  location  of  objects  and  therefore 
it  is  called  the  Where  pathway  or,  again  for  anatomical  reasons,  the 
dorsal,  or  occipito-parietal,  pathway. 


Most  of  our  work  during  the  first  period  of  the  grant  period  was  devoted  to 
the  development  of  model  for  the  implementation  of  the  What  pathway.  The 
underlying  mechanism  is  "temporal  tagging:"  it  is  assumed  that  the 
attended  region  of  the  visual  field  is  distinguished  from  the  imattended ' 
parts  by  the  temporal  fine-structure  of  the  neuronal  spike  trains.  We  have 
shown  that  temporal  tagging  can  be  achieved  by  introducing  oscillations 
(Niebur,  Koch  and  Rosin,  1993)  or  moderate  levels  of  correlation  among 
groups  of  cells  (Niebtir  and  Koch,  1994)  and  that  the  tag  thus  generated 
may  be  available  at  all  stages  of  the  perceptual  hierarchy. 

How  can  such  temporal  modulation  be  obtained?  Periodicity  can  be  — 
generated  by  subthreshold  additional  input.  To  generate  synchronous 
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activity,  we  have  suggested  a  simple  mechanism,  namely  a  very  brief 
common  input  to  all  cells  which  respond  to  attended  stimuli.  Such 
(excitatory)  input  will  increase  the  propensity  of  postsynaptic  cells  to  fire  for 
a  very  short  time  after  receiving  this  input,  and  thereby  increase  the 
correlation  between  spike  trains  without  necessarily  increasing  the  average 
firing  rate. 

Over  the  last  year,  we  have  developed  a  model  of  the  control  system  which 
generates  such  modulating  input.  We  have  shown  that  it  is  possible  to 
construct  an  integrated  system  of  attentional  control  based  on  neuronally 
plausible  elements  and  which  is  compatible  with  the  anatomy  and 
physiology  of  the  primate  visual  system.  The  system  scans  a  visual  scene 
and  identifies  its  most  salient  parts. 

We  have  also  pursued  the  question  of  the  neuronal  implementation  of 
feature-based  attention.  We  showed  (Usher  and  Niebur,  1995)  that  the 
d3mamical  response  of  cortical  IT  neurons  in  a  delayed-match-to-sample 
task  can  be  imderstood  as  an  attentional  modulation  which  is  mediated  by  a 
simple  feedback  mechanism  involving  prefrontal  cortex.  This  study 
complements  our  work  on  location-based  attention  and  generalizes  our 
model  to  incorporate  those  attentional  phenomena  which  do  not  depend  on 
the  activity  in  a  saliency  map. 

Finally,  in  collaboration  with  Dr.  Francis  Crick  at  the  Salk  Institute  we 
have  continued  to  pursue  the  question  of  the  neuronal  correlate  of 
awareness  (Crick  and  Koch,  1995).  More  specifically,  in  what  neurons  in 
what  brain  areas  does  attention  and  the  content  of  visual  awareness  arise. 
We  conclude  that  neurons  that  mediate  visual  awareness  must  directly 
project  to  the  planning  stages  of  the  brain,  that  is  to  pre-frontal  cortical 
areas.  We  therefore  conclude  that  the  firing  of  neurons  in  primary  visual 
cortex,  while  necessary  for  most  of  conscious  vision,  do  not  directly  cause 
visual  awareness.  This  proposal  has  caused  a  lot  of  controversy,  with  a 
number  of  laboratories  actively  working  on  evaluating  its  experimental 
status.  Indeed,  one  such  psychophysical  study,  originating  in  our 
laboratory,  has  just  appeared  in  Nature  (Kolb  and  Braun,  1995). 

Because  our  model  of  the  dorsal  What  pathway  has  interesting 
implications  for  the  manner  in  which  computer-vision  systems  allocate 
their  computational  resources  (see  also  out  "Futvire  outlook"  section),  we 
will  describe  it  in  more  detail. 

A  Simple  Model  of  The  Dorsal  Pathway 

Overall  Structure 

Fig\are~l  shows  an  overview  of  the  model  Where  pathway.  Input  is  provided 
in  the  form  of  digitized  images  from  an  NTSC  camera  which  is  then 
analyzed  in  various  feature  maps.  These  maps  are  organized  ziround  the 
known  operations  in  early  visual  cortices.  They  are  implemented  at 
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different  spatial  scales,  and  each  feature  is  computed  in  a  center-surround 
structure  akin  to  visual  receptive  fields  (Abelson  et  al.,  1984).  The  features 
implemented  so  far  are  the  three  principal  components  of  primate  color 
vision  (intensity,  red-green,  blue-yellow),  four  orientations,  and  temporal 
change. 


Target  Selection  and  the  Saliency  Map 

The  task  of  the  saliency  map  is  the  computation  of  the  salience  at  every 
location  in  the  visual  field  and  the  subsequent  selection  of  the  most  salient 
areas  or  objects.  At  any  time,  only  one  such  area  is  selected.  The  feature 
maps  provide  current  input  to  the  saliency  map.  The  output  of  the  saliency 
map  consists  of  a  spike  train  from  neurons  corresponding  to  this  selected 
area  in  the  topographic  map  which  project  to  the  ventral  ("What") 
pathway.  By  this  mechanism,  they  are  "tagged"  by  modulating  the 
temporal  structure  of  the  neuronal  signals  corresponding  to  attended 
stimuli. 

Once  all  relevant  features  have  been  computed  in  the  various  feature  maps, 
they  have  to  be  combined  to  5deld  the  salience,  i.e.  a  scalar  quantity.  In  our 
model,  we  solve  this  task  by  simply  adding  the  activities  in  the  different 
feature  maps  (reduced  to  the  size  of  the  saliency  map  using  a  Gaussian 
pyramid). 

At  any  given  time,  the  maximum  of  this  map  is  therefore  the  most  salient 
stimulus.  As  a  consequence,  this  is  the  stimulus  to  which  the  focus  of 
attention  should  be  directed  next  to  allow  more  detailed  inspection  by  the 
more  powerful  "higher"  process  which  are  not  available  to  the  massively 
parallel  feature  maps.  This  maximum  is  selected  by  application  of  a 
winner-take-all  mechanism. 

For  a  static  image,  the  system  would  so  far  attend  continuously  the  most 
conspicuous  stimulus.  This  is  neither  observed  in  biological  vision  nor 
desirable  from  a  functional  point  of  view;  instead,  after  inspection  of  any 
point,  there  is  usually  no  reason  to  dwell  on  it  any  longer  and  the  next-most 
salient  point  should  be  attended. 

We  achieve  this  behavior  by  introducing  feedback  from  the  winner-take-all 
array.  When  a  spike  occms  in  the  WTA  network,  the  integrators  in  the 
saliency  map  receive  additional  input  with  the  spatial  structure  of  an 
inverted  Mexican  hat,  ie.  a  difference  of  i.e..  The  (inhibitory)  center  is  at 
the  location  of  the  winner  which  becomes  thus  inhibited  in  the  saliency 
map  and,  consequently,  attention  switches  to  the  next-most  conspicuous 
location.  At  the  same  time,  the  system  avoids  to  return  to  the  same  location 
it  has  just  visited,  a  phenomenon  which  is  well-known  in  psychophysics 
under  the  term  "inhibition  of  return." 
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Simulation  Results 


We  have  studied  the  system  with  inputs  constructed  analogously  to  typical 
visual  psychophysical  stimuli.  For  instance,  bright  spots  in  dark 
backgrounds  (or  dark  spots  in  bright  backgroimds)  are  reliably  detected, 
and  the  focus  of  attention  immediately  jumps  to  such  stimuli.  If  there  is 
more  than  one  such  stimulus,  the  system  scans  them  one-by-one,  in  the 
order  of  decreasing  contrast  from  the  background.  The  same  is  true  for 
stimuli  which  have  a  color  or  orientation  different  from  that  of  the 
background,  or  for  moving  stimuli  in  front  of  a  static  background.  In  all 
cases,  the  elements  can  have  different  sizes  ii.e.  different  spatial 
dimensions).  The  system  was  also  applied  to  images  of  natural  and 
artificial  environments  (recorded  with  a  commercial  NTSC  camcorder)  and 
was  in  general  successful  in  selecting  the  most  salient  portions  of  the 
scenes  (see  Fig.~2). 

Space  limitations  prevent  a  detailed  presentation  of  these  results  in  this 
report.  Some  of  the  results  have  been  published  already  (Niebur  and  Koch, 
1995)  others  are  contained  in  a  forthcoming  publication. 

Conclusion  And  Outlook 

We  present  in  this  final  technical  report  a  protot3rpe  for  an  integrated 
system  mimicking  the  control  of  visual  selective  attention.  Otar  model  is 
compatible  with  the  known  anatomy  and  physiology  of  the  primate  visual 
system,  and  its  different  parts  communicate  by  signals  which  are  neurally 
plausible.  The  model  identifies  the  most  salient  points  in  a  visual  scenes 
one-by-one  and  scans  the  scene  autonomously  in  the  order  of  decreasing 
saliency.  This  allows  the  control  of  a  subsequently  activated  processor 
which  is  specialized  for  detailed  object  recognition.  At  present,  saliency  is 
determined  by  combining  the  input  from  a  set  of  feature  maps  with  fixed 
weights.  In  future  work,  we  will  generalize  our  approach  by  introducing 
plasticity  in  these  weights  and  thus  adapting  the  system  to  the  task  at  hand. 

Partially  funded  by  a  Multidisciplinary  Research  Program  (MURI)  via  the 
Office  of  Naval  Research  (ONR),  we  are  now  porting  this  suite  of  programs 
onto  pentium-based  machines  for  studying  machine  vision  application.  In 
particular,  we  are  focusing  on  the  problem  of  very  rapid  identification  of 
particular  features  (e.g.  faces  or  weapons)  in  video  sequences  via  a 
saliency-map  based  procedure.  This  involves  the  need  to  learn  how  the 
various  features  map  can  be  mapped  onto  the  saliency  map  (in  a 
supervised  or  unsupervised  manner)  to  assure  optimal  detection. 
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Figure  1:  Overview  of  the  model  Where  pathway.  Featvires  are  computed 
as  center-surround  differences  at  4  different  spatial  scales  (only  3  feature 
maps  shown).  They  are  combined  and  integrated  in  the  saliency  map 
("SM")  which  provides  input  to  an  array  of  integrate-and-fire  neurons  with 
global  inhibition.  This  array  ("WTA")  has  the  functionality  of  a  winner- 
take-all  network  and  provides  the  output  to  the  ventral  pathway  ("V2")  as 
well  as  feedback  to  the  saliency  map  (bold  arrow). 

Figure  2:  Example  of  the  performance  of  the  attentional  control  system. 
Input  to  the  system  is  a  picture  of  the  Caltech  bookstore,  here  reproduced  in 
gray-scale.  Top  left:  The  most  salient  point  in  the  image  is  a  red  banner  on 
the  wall  of  the  bookstore  (in  the  center  of  the  image).  Therefore,  this 
becomes  the  location  where  the  focus  of  attention  is  directed  to  first  (shown 
as  white  square).  Top  left,  top  right,  bottom  left:  Trajectory  of  the  focus  of 
attention,  shown  as  dark  line,  at  the  simulated  times  listed  above  the 
respective  images.  The  focus  of  attention  shifts  about  every  35  msec  to  the 
next-most  salient  location.  Note  that  at  t=280  msec  (top  right),  the 
continuously  high  input  leads  to  a  re-focusing  on  the  location  attended  first 
(at  t=140  msec).  However,  the  system  does  not  enter  in  a  loop  as  is  seen  at 
t=315  msec  (bottom  left).  Bottom  right:  After  t=540  msec,  the  system  has 
visited  11  locations  (shown  by  white  squares;  one  location  visited  twice)  and 
scanned  a  significant  portion  of  the  image.  Bottom  center:  Instantaneous 
activity  in  the  saliency  map  at  the  times  corresponding  to  the  images 
pointed  to  by  the  arrows.  Note  the  dark  ''blobs"  caused  by  the  inhibitory 
feedback  at  the  just-visited  location.  The  center-bottom  map  (without  arrow) 
shows  the  input  to  the  saliency  map,  i.e.  the  sum  of  all  feature  maps. 
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